The following dataset was gathered from Kaggle.com and was collected from the UCI Machine Learning Repository The goal of this project is to analyse the effects of various factors that result in thyroid disease recurrence of well differentiated thyroid cancer. The factors are:
In this report you will first find statistical analysis of the dataset with the hopes of determining the influential variables in thyroid disease recurrence, followed by a number of models developed to predict it based on said variables
The following correlation matrix gave the following promising variables:
Here are the corresponding correlation tests. These tests were done with the “pearson” method
[1] 1.741654e-32
[1] 3.710287e-44
[1] 4.546791e-11
[1] 2.192081e-11
[1] 2.776541e-07
The following represent only those whos cancer recurred
[1] 334
Here I choose to utilize Stepwise Regression with the significant variables (T, N, Gender, Smoking, Age) to further narrow my variables
(Intercept) T N Gender Age
-0.248510736 0.092106846 0.246961804 0.132810517 0.004227633
Call:
lm(formula = Recurred ~ T + N + Gender + Age, data = encoded_data)
Residuals:
Min 1Q Median 3Q Max
-0.77382 -0.16654 -0.04562 0.10054 1.00844
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.248511 0.048136 -5.163 3.94e-07 ***
T 0.092107 0.013945 6.605 1.35e-10 ***
N 0.246962 0.021216 11.641 < 2e-16 ***
Gender 0.132811 0.043474 3.055 0.002411 **
Age 0.004228 0.001101 3.841 0.000144 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3123 on 378 degrees of freedom
Multiple R-squared: 0.5246, Adjusted R-squared: 0.5196
F-statistic: 104.3 on 4 and 378 DF, p-value: < 2.2e-16
The model chosen by the stepwise regression removes the Smoking variable, and provides us with a statistically significant p-value. Despite this, the following plots show evidence of non-linearity, and therefore show the model is not entirely reliable.
Here I decide to use a tree model, utilizing the ANOVA method, and pruned.
Regression tree:
rpart(formula = Recurred ~ T + N + Gender + Smoking + Age, data = encoded_data,
method = "anova")
Variables actually used in tree construction:
[1] Age N T
Root node error: 77.546/383 = 0.20247
n= 383
CP nsplit rel error xerror xstd
1 0.378074 0 1.00000 1.00054 0.049546
2 0.077797 1 0.62193 0.65010 0.060747
3 0.055048 2 0.54413 0.63360 0.064615
4 0.022933 3 0.48908 0.55093 0.060980
5 0.011260 4 0.46615 0.53213 0.058289
6 0.010000 5 0.45489 0.56758 0.061893
MSE: 0.09438056
R-squared: 0.5338522
The following is the result of utilizing the k values 1,3,5,7,9,15,19,25,50 in a kNN algorithm of all of the variables.
This shows that a model using k = 1 is the most accurate, the following is further analysis of that model
[1] "Accuracy: 0.828358208955224"
Overall, the following models provided interesting insight into what factors affect the recurrence of Thyroid disease, with the kNN algorithm proving to be the most accurate.
Author: Sean Theisen
---
title: "Thyroid Disease Analysis"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard) #For dashboard creation
knitr::opts_chunk$set(echo = TRUE)
library(caret) #Useful functions
library(ggplot2) #Visualizations
library(corrplot)
library(MASS)
library(gridExtra)
library(rpart)
library(rpart.plot)
library("e1071")
library("caTools")
library("class")
```
```{r, include=FALSE}
data <- read.csv("C:/Users/seanj/projects/thyroid_dash/data/Thyroid_Diff.csv")
#Label encoding of categorical variables
label_encode <- function(x){
if(is.factor(x) || is.character(x)){
as.numeric(factor(x))
}else{
x
}
}
encoded_data <- as.data.frame(lapply(data, label_encode))
encoded_data <- as.data.frame(lapply(encoded_data, function(x) x - 1))
```
# Introduction
The following dataset was gathered from Kaggle.com and was collected from the UCI Machine Learning Repository
The goal of this project is to analyse the effects of various factors that result in thyroid disease recurrence of well differentiated thyroid cancer. The factors are:
1. Age: The age of the patient at the time of diagnosis or treatment.
2. Gender: The gender of the patient (male or female).
3. Smoking: Whether the patient is a smoker or not.
4. Hx Smoking: Smoking history of the patient (e.g., whether they have ever smoked).
5. Hx Radiotherapy: History of radiotherapy treatment for any condition.
6. Thyroid Function: The status of thyroid function, possibly indicating if there are any abnormalities.
7. Physical Examination: Findings from a physical examination of the patient, which may include palpation of the thyroid gland and surrounding structures.
8. Adenopathy: Presence or absence of enlarged lymph nodes (adenopathy) in the neck region.
9. Pathology: Specific types of thyroid cancer as determined by pathology examination of biopsy samples.
10. Focality: Whether the cancer is unifocal (limited to one location) or multifocal (present in multiple locations).
11. Risk: The risk category of the cancer based on various factors, such as tumor size, extent of spread, and histological type.
12. T: Tumor classification based on its size and extent of invasion into nearby structures.
13. N: Nodal classification indicating the involvement of lymph nodes.
14. M: Metastasis classification indicating the presence or absence of distant metastases.
15. Stage: The overall stage of the cancer, typically determined by combining T, N, and M classifications.
16. Response: Response to treatment, indicating whether the cancer responded positively, negatively, or remained stable after treatment.
17. Recurred: Indicates whether the cancer has recurred after initial treatment.
In this report you will first find statistical analysis of the dataset with the hopes of determining the influential variables in thyroid disease recurrence, followed by a number of models developed to predict it based on said variables
# Analysis of Correlations
## Column
```{r, echo=FALSE, out.width="100%"}
#Statistical Analysis
res <- cor(encoded_data)
corrplot(res, method='color')
```
## Column
The following correlation matrix gave the following promising variables:
1. T
2. N
3. Gender
4. Smoking
5. Age
Here are the corresponding correlation tests. These tests were done with the "pearson" method
```{r, echo=FALSE}
cor.test(encoded_data$T, encoded_data$Recurred)$p.value
cor.test(encoded_data$N, encoded_data$Recurred)$p.value
cor.test(encoded_data$Gender, encoded_data$Recurred)$p.value
cor.test(encoded_data$Smoking, encoded_data$Recurred)$p.value
cor.test(encoded_data$Age, encoded_data$Recurred)$p.value
```
# Statistical Analysis of Correlations
The following represent only those whos cancer recurred
```{r, echo=FALSE}
recurred <- subset(encoded_data, Recurred == 1)
# Loop through each column except the first one (assuming the first column is 'recurred')
correlated_df <- encoded_data[, c("Recurred", "Gender", "Smoking", "Age", "T", "N")]
```
## Column
```{r, echo=FALSE, out.width="100%"}
gender_counts <- table(correlated_df[["Gender"]])
names(gender_counts) <- c("F", "M")
plot1 <- barplot(gender_counts, main = "Gender")
smoking_counts <- table(correlated_df[["Smoking"]])
names(smoking_counts) <- c("No", "Yes")
plot2 <- barplot(smoking_counts, main = "Smoking")
mean(smoking_counts["No"])
```
## Column
```{r, echo=FALSE, out.width="100%"}
age_counts <- table(correlated_df[["Age"]])
plot3 <- barplot(age_counts, xlab="Age", ylab="Freq", main="Age")
```
## Column
```{r, echo=FALSE, out.width="100%"}
T_counts <- table(correlated_df[["T"]])
plot4 <- barplot(T_counts, main="T")
N_counts <- table(correlated_df[["N"]])
plot5 <- barplot(N_counts, main="N")
```
# Stepwise Regression
## Column
Here I choose to utilize Stepwise Regression with the significant variables (T, N, Gender, Smoking, Age) to further narrow my variables
```{r, echo=FALSE}
lm <- lm(Recurred ~ T + N + Gender + Smoking + Age, data=encoded_data)
step.model <- stepAIC(lm, direction="both", trace=0)
srm <- lm(Recurred ~ T + N + Gender + Age, data=encoded_data)
step.model$coefficients
summary(srm)
```
## Column {data-width=500}
The model chosen by the stepwise regression removes the Smoking variable, and provides us with a statistically significant p-value. Despite this, the following plots show evidence of non-linearity, and therefore show the model is not entirely reliable.
```{r, out.width="50%", echo=FALSE}
grid.arrange(plot(srm, which=1), plot(srm, which=2), plot(srm, which=3), plot(srm, which=4), ncol=2)
```
# Tree Model
## Column
Here I decide to use a tree model, utilizing the ANOVA method, and pruned.
```{r, echo=FALSE}
fit <- rpart(Recurred ~ T + N + Gender + Smoking + Age, data=encoded_data, method="anova")
fit_cp <- printcp(fit)
optimal_cp <- fit_cp[which.min(fit_cp[,"xerror"]),"CP"]
pruned_fit <- prune(fit, cp = optimal_cp)
rpart.plot(pruned_fit)
```
## Column
The following is analysis of the tree model
```{r, echo=FALSE}
pred <- predict(pruned_fit, encoded_data)
mse <- mean((encoded_data$Recurred - pred)^2)
rsq <- 1 - sum((encoded_data$Recurred - pred)^2) / sum((encoded_data$Recurred - mean(encoded_data$Recurred))^2)
cat("MSE: ", mse, "\nR-squared:", rsq, "\n")
```
```{r, echo=FALSE}
par(mfrow = c(2, 2))
# Residuals vs Fitted
plot(pred, residuals = encoded_data$Recurred - pred, main = "Residuals vs Fitted")
abline(h = 0, col = "red")
# Q-Q Plot of Residuals
qqnorm(encoded_data$Recurred - pred)
qqline(encoded_data$Recurred - pred, col = "red")
# Scale-Location Plot
plot(pred, sqrt(abs(encoded_data$Recurred - pred)), main = "Scale-Location")
abline(h = 0, col = "red")
# Cook's Distance
cooksd <- cooks.distance(lm(Recurred ~ ., data = encoded_data))
plot(cooksd, main = "Cook's Distance")
abline(h = 4/length(encoded_data$Recurred), col = "red")
```
# kNN Classifier
## Column
### kNN selection
The following is the result of utilizing the k values 1,3,5,7,9,15,19,25,50 in a
kNN algorithm of all of the variables.
```{r, echo=FALSE}
split <- sample.split(encoded_data, SplitRatio=.7)
train_cl <- subset(encoded_data, split=="TRUE")
test_cl <- subset(encoded_data, split=="FALSE")
train_scale <- subset(train_cl[, 1:17])
test_scale <- subset(test_cl[, 1:17])
k_values <- c(1,3,5,7,9,15,19,25,50)
accuracy_values <- sapply(k_values, function(k){
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Recurred,
k=k)
1-mean(classifier_knn != test_cl$Recurred)
})
accuracy_data <- data.frame(K = k_values, Accuracy=accuracy_values)
ggplot(accuracy_data, aes(x = K, y = Accuracy)) +
geom_line(color = "lightblue", size = 1) +
geom_point(color = "lightgreen", size = 3) +
labs(title = "Model Accuracy for Different K Values",
x = "Number of Neighbors (K)",
y = "Accuracy") +
theme_minimal()
```
## Column
### k = 1
This shows that a model using k = 1 is the most accurate, the following is further
analysis of that model
```{r, echo=FALSE}
classifier_knn <- knn(train = train_scale,
test = test_scale,
cl = train_cl$Recurred,
k=1)
acc <- 1-mean(classifier_knn != test_cl$Recurred)
cm <- table(test_cl$Recurred, classifier_knn)
print(paste("Accuracy: ", acc))
plot(classifier_knn, col=rainbow(2), main="Classification of Recurrence", xlab="Recurrence (0=No, 1=Yes)")
```
# Conclusion
Overall, the following models provided interesting insight into what factors affect the recurrence of Thyroid disease, with the kNN algorithm proving to be the most accurate.
Author: Sean Theisen